Explainability - by Example

Discover how explainability can be implemented across traditional and generative AI models through practical examples in this section.

Linear Regression Model

Problem Statement & Solution

Problem Statement: Inaccurate and inconsistent valuation of residential real estate properties results in significant financial risks for investors, buyers, and sellers. Current valuation methods rely heavily on expert judgment and limited data, leading to potential overvaluation or undervaluation.

Solution: Develop a Linear Regression model to accurately predict future market values of residential real estate properties based on a comprehensive dataset of property and market characteristics.

Training Data Details

Feature

Description

MedInc 

Median Income: Median income for households within a block of houses (measured in tens of thousands of US Dollars) [10k$] 

HouseAge 

Age: Median age of a house within a block; a lower number is a newer building [years] 

AveRooms 

Total number of rooms within a block 

AveBedrms 

Total number of bedrooms within a block 

Population 

Total number of people residing within a block 

AveOccup 

average number of household members 

Latitude 

A measure of how far north a house is; a higher value is farther north [°] 

Longitude 

A measure of how far west a house is; a higher value is farther west [°] 

MedHouseVal 

Median house value for households within a block (measured in US Dollars) [$] 

Global explainability in ML models refers to the ability to understand and interpret the overall behavior and decision-making process of a machine learning model across all its predictions, rather than just individual instances. It provides insights into how different features contribute to the model's predictions on average. Following are a few model-agnostic techniques that identify important features influencing model predictions.

Kernel SHAP

Kernel SHAP analysis indicates that income and occupancy are the top features affecting the model's predictions. This means these factors play a significant role in explaining how the model arrives at its conclusions. In simpler terms, changes in income and occupancy levels have the largest impact on predicting the future market value of residential real estate properties.

image-20240918-041824.png

Local explainability in ML models refers to the ability to understand and interpret the decision-making process of a model for specific instances or predictions. Unlike global explainability, which focuses on the overall behavior of the model, local explainability provides insights into how different features contribute to a model's prediction for a particular input. This helps users understand why the model made a specific decision and allows for greater trust and transparency in the model's outputs

LIME (Local Interpretable Model-agnostic Explanations)

LIME analysis shows that income, location, and occupancy are the most influential features for our model. This indicates that changes in these factors significantly impact the predictions for residential real estate market values. In simpler terms, variations in income, where a property is located, and how many people live in it play a crucial role in determining its future value.

image-20240918-041520.png

Other Explainer methods

SHAP (SHapley Additive exPlanations)

SHAP analysis reveals that income, location, and occupancy have the highest feature importances. This suggests these features significantly contribute to explaining the model's predictions. In simpler terms, variations in these features have the greatest influence on predicting the future market value of residential real estate properties.

image-20240519-124157.png

Feature Importance

Permutation Importance

By shuffling the order of income, location, and occupancy, permutation importance assessed the impact on model performance. High importance for these features indicates that shuffling their values significantly disrupts the model's predictions. This suggests they're crucial for the model to make accurate decisions.

image-20240519-124519.png

Variable Importance

Partial Dependence Variance

Decomposing the model's predictions by income, location, house age and occupancy reveals high partial dependence variance for these features. This signifies that the average prediction of the model exhibits substantial variation with changes in their values. In simpler terms, income, location, and occupancy independently exert a strong influence on the model's output.

image-20240519-125148.png

Binary Classifier Model

Problem Statement & Solution

Problem Statement: The problem at hand is to accurately predict whether an individual is at risk of developing heart disease. This prediction can be made by analyzing various health-related factors, demographic information, and medical history.

Solution: Develop a Binary Classification model to accurately predict individuals who are at risk of heart disease so that preventive measures can be implemented to improve their health outcomes.

Training Data Details

Feature

Description

age

Age of the individual in years.

sex

Gender of the individual (1 = male, 0 = female).

chest_pain_type

Type of chest pain experienced (0-3 categorical values).

resting_blood_pressure

Resting blood pressure (in mm Hg) measured when the individual is at rest.

serum_cholesterol

Serum cholesterol level (in mg/dl).

fasting_blood_sugar

Fasting blood sugar level (> 120 mg/dl = 1, otherwise = 0).

resting_ecg_results

Resting electrocardiographic results (0-2 categorical values).

max_heart_rate_achieved

Maximum heart rate achieved during exercise (in bpm).

exercise_induced_angina

Exercise induced angina (1 = yes, 0 = no).

oldpeak

ST depression induced by exercise relative to rest.

slope

Slope of the peak exercise ST segment (0-2 categorical values).

number_of_vessels_fluro

Number of major vessels (0-3) colored by fluoroscopy.

thalassemia

Thalassemia status (1 = normal, 2 = fixed defect, 3 = reversible defect).

is_disease

Denotes whether the individual has heart disease (1 = yes, 0 = no).

Global explainability in ML models refers to the ability to understand and interpret the overall behavior and decision-making process of a machine learning model across all its predictions, rather than just individual instances. It provides insights into how different features contribute to the model's predictions on average. Following are a few model-agnostic techniques that identify important features influencing model predictions.

Kernel SHAP

Global Kernel SHAP analysis reveals that sex, chest pain type, and the number of vessels colored by fluoroscopy are the most influential features across all predictions in our heart disease prediction model. This indicates that these factors consistently affect the risk assessment for heart disease across the entire dataset. In simpler terms, the overall patterns suggest that a person's gender, the type of chest pain they experience, and the number of major vessels affected are key indicators in evaluating heart disease risk

image-20240918-075717.png

Local explainability in ML models refers to the ability to understand and interpret the decision-making process of a model for specific instances or predictions. Unlike global explainability, which focuses on the overall behavior of the model, local explainability provides insights into how different features contribute to a model's prediction for a particular input. This helps users understand why the model made a specific decision and allows for greater trust and transparency in the model's outputs

LIME (Local Interpretable Model-agnostic Explanations)

LIME analysis reveals that sex, chest pain type, and the number of vessels colored by fluoroscopy are the most influential features for our heart disease prediction model. This suggests that changes in these factors significantly impact the model's predictions regarding an individual’s risk of developing heart disease. In simpler terms, variations in a person's gender, the type of chest pain they experience, and how many major vessels are affected are critical in assessing their likelihood of having heart disease. This highlights the importance of these features in understanding individual risk profiles

image-20240918-080334.png

Other Explainer methods

Problem Statement: The problem at hand is to accurately predict whether an employee is likely to leave an organization. This prediction can be made by analyzing various factors related to the employee's demographics, job satisfaction, and work environment.

Solution: Develop a Binary Classification model to accurately predict employees who are at risk of attrition so that proactive measures can be taken to retain them.

Training Data Details

Feature

Description

num_production_digital_project_changes_last_12_months

Number of changes made to a production-level digital project within the last 12 months

pct_time_non_revenue_last_12_months 

Percentage of time an employee spent on non-revenue-generating activities (e.g., administrative tasks, meetings) in the past 12 months.

emp_experience_diff_average_team_leadership_experience_last_9_months 

Difference between an employee's experience and the average leadership experience of their team in the past 9 months. 

num_promotions_in_past_2_years  

Number of promotions an employee has received in the past 2 years. 

emp_experience_diff_average_team_experience_3_vs_9_months  

Difference between an employee's experience and the average team experience at two different time points: 3 and 9 months ago.

num_production_project_changes_last_6_months  

Number of changes or updates made to production projects in last 6 Months

num_production_project_changes_last_9_months 

Number of changes or updates made to production projects in last 9 Months

education_level_bachelors 

Indicates whether an employee has a bachelor's degree.

education_level_masters 

Indicates whether an employee has a master's degree

is_attrited

Denotes whether the employee has resigned the job or not.

SHAP (SHapley Additive exPlanations)

SHAP analysis assigns high importance to bench time, average cohort rating, education level, and number of projects changed. This indicates these features substantially influence the marginal contribution of each feature to the model's predictions. In other words, variations in these features significantly alter how much credit each feature receives for a specific prediction.

d5aefe69-20a8-4474-8ffc-15427f6ec79d.png

Partial Dependence Variance

Partial dependence variance analysis identifies high importance for bench time, average cohort rating, education level, and number of projects changed. This means changes in these features lead to significant variations in the average prediction of the model. In simpler terms, these features independently have a strong effect on where the model's output lands on average.

36a6d004-4be0-49e4-a0d8-3806baffdb42.png

Anchor

Anchor analysis identifies a specific data point likely to exert significant influence on a model prediction. This data point is characterized by bench time exceeding 44.44 in the last 9 months, alongside a non-positive change in average team leadership rating between month 3 and month 2

0e84de61-cb44-4fe1-a9b7-4d881f7701e1.png

Time Series Forecasting Model

Problem Statement & Solution

Problem Statement: Insufficient inventory can result in lost sales due to unavailability of products. Overstocking can tie up capital and increase holding costs.

Solution: Accurate sales forecasting is essential for effective business operations, financial stability, and customer satisfaction. Failure to forecast sales can lead to a variety of problems, including inventory issues, financial difficulties, operational inefficiencies, missed market opportunities, and reduced customer satisfaction.

Training Data Details

Feature

Description

Store 

Store number 

Dept 

Department number 

IsHoliday

Whether the week is a special holiday week

Type 

Stores has 3 types as A, B and C according to their sizes. Almost half of the stores are bigger than 150000 and categorized as A.  

Size 

Stores sizes 

Temperature 

Average temperature in the region

Fuel_Price 

Cost of Fuel in the region 

MarkDown1 

Anonymized data related to promotional markdowns that Walmart is running 

MarkDown2 

Anonymized data related to promotional markdowns that Walmart is running.

MarkDown3 

Anonymized data related to promotional markdowns that Walmart is running 

MarkDown4 

Anonymized data related to promotional markdowns that Walmart is running 

MarkDown5 

Anonymized data related to promotional markdowns that Walmart is running

CPI 

The consumer price index 

Unemployment 

The unemployment rate 

Day 

Day 

Week 

Week 

Month 

Month 

Quarter 

Quarter 

Year 

Year 

Weekly_Sales 

Sales for the given department in the given store 

Explanation:

LIME: The LIME analysis identifies critical factors such as store size, weekly holidays, regional fuel prices, month, and quarter, which are pivotal for the model's decision-making process in forecasting weekly sales based on the historical time series data.

image-20241009-112203.png

Chain of Thought

Problem Statement & Solution

Problem Statement: When users cannot understand how AI systems reach their decisions, it can erode trust in the technology. This is particularly important in critical domains like healthcare, finance, and criminal justice, where decisions can have significant consequences.

Solution: Chain of thought Reasoning mirrors human reasoning. It facilitates systematic problem-solving by breaking down complex tasks into a coherent series of logical deductions.

Explanation:

Prompt: What is the largest river in India?

Chain of thought gives the detailed step by step reasoning for the LLM response to above prompt.

image-20241009-110724.png

Thread of Thought

Problem Statement & Solution

Problem Statement: When users cannot understand how AI systems reach their decisions, it can erode trust in the technology. This is particularly important in critical domains like healthcare, finance, and criminal justice, where decisions can have significant consequences.

Solution: Thread of thought Reasoning mirrors human reasoning. It facilitates systematic problem-solving by breaking down complex tasks into a coherent series of logical deductions.

Explanation:

Prompt: What is the largest river in India?

Thread of thought gives the detailed step by step reasoning for the LLM response to above prompt.

image (2).png

ReRead Reasoning

Problem Statement & Solution

Problem Statement: When users cannot understand how AI systems reach their decisions, it can erode trust in the technology. This is particularly important in critical domains like healthcare, finance, and criminal justice, where decisions can have significant consequences.

Solution: ReRead Reasoning mirrors human reasoning. Unlike most thought eliciting prompting methods, such as Chain-of Thought (CoT), which aim to elicit the reasoning process in the output, RE2 shifts the focus to the input by processing questions twice, thereby enhancing the understanding process. Consequently, RE2 demonstrates strong generality and compatibility with most thought eliciting prompting methods

Explanation:

Prompt: What is the largest river in India?

Thread of thought gives the detailed step by step reasoning for the LLM response to above prompt.

image (2).png

Graph of Thought

Problem Statement & Solution

Problem Statement: When users cannot understand how AI systems reach their decisions, it can erode trust in the technology. This is particularly important in critical domains like healthcare, finance, and criminal justice, where decisions can have significant consequences.

Solution: Graph of thought Reasoning mirrors human reasoning. It facilitates systematic problem-solving by breaking down complex tasks into a coherent series of logical deductions.

Explanation:

Prompt: What is the largest river in India?

Graph of thought gives the detailed step by step reasoning for the LLM response to above prompt.

image (3).png

Chain of verification

Problem Statement & Solution

Problem Statement: When users cannot understand how AI systems reach their decisions, it can erode trust in the technology. This is particularly important in critical domains like healthcare, finance, and criminal justice, where decisions can have significant consequences.

Solution: Chain of verification helps to understand the LLM response by verifying the base answer with various question and answers.

Verification:

Prompt: What is the largest river in India?

Chain of verification asks the LLM with 5 different verification questions with the context of actual LLM reasoning and based on the gives 5 answers it derives the final answer.

image (4).png

Token Importance

Problem Statement & Solution

Problem Statement: In AI language models, the importance of tokens can significantly influence the generated responses. Understanding which tokens (words or phrases) are most impactful can be crucial for interpreting and trusting the model's decisions. This is particularly relevant in applications where precise and reliable outputs are essential, such as in healthcare, finance, and legal domains.

Solution: Token Importance helps in understanding how different tokens contribute to the AI model's responses. By analyzing the relative importance of tokens, users can gain insight into which parts of the input significantly affect the model’s output. This can enhance transparency and trust in the AI system's decision-making process.

Explanation:

Prompt: What is the largest river in India?

  1. Displays matrix with top 10 tokens and their importance

  2. The Token Importance Distribution Chart illustrates the significance of individual tokens by displaying the distribution of their associated impact scores. The chart's shape reveals the following insights:

    • Flat Distribution: Tokens have similar importance, with no clear standout

    • Left-Peaked Distribution: Tokens have low impact scores, indicating lesser importance

    • Right-Peaked Distribution: Tokens have high impact scores, signifying greater importance

  3. Displays importance of each token (considered top 10 tokens based on their importance for this chart).

image-20240917-102416.png

Search Augmentation

Problem Statement & Solution

Problem Statement: When users cannot discern how AI systems reach their conclusions, it can undermine trust in the technology. This issue is particularly pressing in critical domains like healthcare, finance, and criminal justice, where decisions can have significant consequences. Ensuring the accuracy and transparency of AI responses is crucial for maintaining user confidence.

Solution: The Chain of Verification through Internet Search offers a systematic approach to validate the accuracy of an AI response by cross-referencing it with multiple reliable online sources. This method involves querying different authoritative sources to confirm the correctness of the AI's answer and provide clarity on how the response was derived.

Explanation:

Prompt: What is the largest river in India?

  1. Internet Search displays final response by cross validating facts generated by thread of thoughts against internet search results.

  2. List of facts used by thread of thoughts while reasoning the LLM response

  3. Explanation based on internet search results

  4. Judgement on the accuracy of LLM response. The validation process compares the LLM's thread of thoughts against internet search results.

    • No: Internet search results contradict the LLM's facts, indicating potential inaccuracies.

    • Yes: Internet search results support the LLM's facts, confirming their validity.

    • Unclear: Internet search results lack sufficient information to determine the accuracy of the LLM's response, requiring further investigation.

image-20240917-104611.png

Logic of Thought

Problem Statement & Solution

Problem Statement: When users cannot understand how AI systems reach their decisions, it can erode trust in the technology. This is particularly important in critical domains like healthcare, finance, and criminal justice, where decisions can have significant consequences.

Solution: The logic of thought technique could encompass a variety of methodologies for understanding, analyzing human thinking or reasoning. In essence, it involves formalized techniques for logical reasoning (like deductive reasoning, critical thinking). The purpose of these techniques is to enhance clarity of thought, ensure decisions are made based on sound logic, and improve the cognitive processes we use to navigate the world.

Explanation:

Prompt: Which is the largest river in India?

The Logic of Thought (LoT) extracts propositions and logical expressions, extending them to generate expanded logical information from the input context. This generated logical information is then utilized as an additional augmentation to the input prompts, thereby enhancing the system's logical reasoning capabilities.

image-20241218-055017.png

Evaluation Metrics

Problem Statement & Solution

Problem Statement: Defining "correct" or "good" explanations can be ambiguous and context dependent. there is no standardized and universally accepted metrics to objectively quantify the quality of explanations generated by LLMs, their overall performance, and the user's satisfaction with the interaction.

Solution: New research on LLM explanation “QUEST” which means Quality of Information, Understanding and Reasoning, Expression Style and Persona, Safety and Harm, and Trust and Confidence, talks about providing explanation for LLM with the following metrics such as Uncertainty, Relevance, coherence, Language Tone, Sentiment Analysis etc. With the help of prompt engineering, we ask the LLM to evaluate the above discussed metrics score and provide the explanation for the same.

Metrics:

Prompt: What is the largest river in India?

Uncertainty quantification and Coherence score are the two-evaluation metrics we have implemented to quantify the quality of the explanations generated by the LLM. having high score of coherence helps to understand how logically the given answer is aligned with the actual query and having less uncertainty score denotes that the LLM has high confidence in providing this answer.

Coherence
  Less Coherent: >=0 and <=30
  Moderately Coherent: >30 and <=70
  Highly Coherent: >70 and <=100

Certainty
  Highly Certain: >=0 and <=30 (less uncertainty)
  Moderately Certain: >30 and <=70 (moderately uncertain)
  Less Certain: >70 and <=100 (highly uncertain)

image (2).png

Note: Few examples, datasets and graphs used in this section are sourced from publicly available information and are attributed to their respective creators.